Key Word(s): matplotlib, seaborn, plots, pandas
CS109A Introduction to Data Science
Lab 5: Exploratory Data Analysis, seaborn, more Plotting¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Material Preparation: Eleni Kaxiras.
======= ======= >>>>>>> upstream/master
CS109A Introduction to Data Science
Lab 5: Exploratory Data Analysis, seaborn, more Plotting¶
Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Material Preparation: Eleni Kaxiras.
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master
#RUN THIS CELL
=======
#RUN THIS CELL
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [67]:
#RUN THIS CELL
>>>>>>> upstream/master
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
<<<<<<< HEAD
Out[1]:
=======
Out[67]:
>>>>>>> upstream/master
# import the necessary libraries
=======
In [ ]:
# import the necessary libraries
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [68]:
# import the necessary libraries
>>>>>>> upstream/master
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 200)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format ='retina'
<<<<<<< HEAD
<<<<<<< HEAD
In [3]:
%%javascript
=======
In [ ]:
%%javascript
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [69]:
%%javascript
>>>>>>> upstream/master
IPython.OutputArea.auto_scroll_threshold = 9999;
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Learning Goals¶
By the end of this lab, you should be able to:
- know how to implement the different types of plots such as histograms, boxplots, etc, that were mentioned in class.
- have
seaborn as well as matplotlib in your plotting toolbox.
This lab corresponds to lecture 6 up to 9 and maps to homework 3.
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
1 - Visualization Inspiration¶

Notice that in “Summers Are Getting Hotter,” above, the histogram has intervals for global summer temperatures on the x-axis, designated from extremely cold to extremely hot, and their frequency on the y-axis.
That was an infographic intended for the general public. In contrast, take a look at the plots below of the same data published at a scientific journal. They look quite different, don't they?

=======
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
1 - Visualization Inspiration¶

Notice that in “Summers Are Getting Hotter,” above, the histogram has intervals for global summer temperatures on the x-axis, designated from extremely cold to extremely hot, and their frequency on the y-axis.
That was an infographic intended for the general public. In contrast, take a look at the plots below of the same data published at a scientific journal. They look quite different, don't they?
<<<<<<< HEAD

<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======

>>>>>>> upstream/master
James Hansen, Makiko Sato, and Reto Ruedy, Perception of climate change. PNAS
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
2 - Implementing Various Types of Plots using matplotlib and seaborn.¶
Before you start coding your visualization, you need to decide what type of vizualization to use. A box plot, a histogram, a scatter plot, or something else? That will depend on the purpose of the plot (is it for performing an inspection on your data (EDA, or for showing your results/conclusions to people) and the number variables that you want to plot.
You have a lot of tools for plotting in Python. The basic one, of course, is matplotlib and there are other libraries that are built on top of it, such as seaborn, bokeh, or altair.
In this class we will continue using matplotlib and also look into seaborn. Those two libraries are the ones you should be using for homework.
Introduction to seaborn¶
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. The library provides a database of useful datasets for educational purposes that can be loaded by typing:
seaborn.load_dataset(name, cache=True, data_home=None, **kws)
For information on what these datasets are : https://github.com/mwaskom/seaborn-data
The plotting functions in seaborn can be decided in two categories¶
'axes-level' functions, such as regplot, boxplot, kdeplot, scatterplot, distplot which can connect with the matplotlib Axes object and its parameters. You can use that object as you would in matplotlib:
f, (ax1, ax2) = plt.subplots(2)
sns.regplot(x, y, ax=ax1)
sns.kdeplot(x, ax=ax2)
ax1 = sns.distplot(x, kde=False, bins=20)
<<<<<<< HEAD
<<<<<<< HEAD
'figure-level' functions, such as lmplot, factorplot, jointplot, relplot, pairplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.
=======
'figure-level' functions, such as lmplot, factorplot, jointplot, relplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
'figure-level' functions, such as lmplot, factorplot, jointplot, relplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.
>>>>>>> upstream/master
To set the parameters for figure-level functions:
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
The Titanic dataset¶
The titanic.csv file contains data for 887 passengers on the Titanic. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their on-board class, their sex, and the fare they paid.
<<<<<<< HEAD
<<<<<<< HEAD
In [4]:
titanic = sns.load_dataset('titanic');
=======
In [ ]:
titanic = sns.load_dataset('titanic');
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [70]:
titanic = sns.load_dataset('titanic');
>>>>>>> upstream/master
titanic.info();
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
In [5]:
titanic.columns
=======
In [ ]:
titanic.columns
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [71]:
titanic.columns
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
Out[5]:
=======
Out[71]:
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Exercise: Drop the following features:'embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone'
<<<<<<< HEAD
<<<<<<< HEAD
In [6]:
# your code here
mary = ['embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
titanic = titanic.drop(columns=mary)
titanic
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [72]:
# your code here
# your code here
columns = ['embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
titanic = titanic.drop(columns=columns)
titanic
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
Out[6]:
=======
Out[72]:
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Exercise: Find for how many passengeres we do not have their deck information.
<<<<<<< HEAD
<<<<<<< HEAD
In [7]:
# your code here
missing_decks = len(titanic[(pd.isna(titanic['deck']) == True)])
missing_decks
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [73]:
# your code here
# your code here
missing_decks = len(titanic[(pd.isna(titanic['deck']) == True)])
missing_decks
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
Out[7]:
=======
Out[73]:
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Histograms¶
Plotting one variable's distribution (categorical and continous)¶
The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).
A histogram displays a quantitative (numerical) distribution by showing the number (or percentage) of the data values that fall in specified intervals. The intervals are on the x-axis and the number of values falling in each interval, shown as either a number or percentage, are represented by bars drawn above the corresponding intervals.
<<<<<<< HEAD
<<<<<<< HEAD
In [9]:
# What was the age distribution among passengers in the Titanic?
=======
In [ ]:
# What was the age distribution among passengers in the Titanic?
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [74]:
# What was the age distribution among passengers in the Titanic?
>>>>>>> upstream/master
import seaborn as sns
sns.set(color_codes=True)
f, ax = plt.subplots(1,1, figsize=(8, 3));
ax = sns.distplot(titanic.age, kde=False, bins=20)
# bug
#ax = sns.distplot(titanic.age, kde=False, bins=20).set(xlim=(0, 90));
ax.set(xlim=(0, 90));
ax.set_ylabel('counts');
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
<<<<<<< HEAD
In [10]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
=======
In [ ]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [75]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
>>>>>>> upstream/master
ax.hist(titanic.age, bins=20);
ax.set_xlim(0,90);
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
<<<<<<< HEAD
<<<<<<< HEAD
Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 3-10.
=======
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 5-10.
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
In [11]:
# your code here
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [76]:
# your code here
>>>>>>> upstream/master
infants = len(titanic[(titanic.age < 3)])
children = len(titanic[(titanic.age >= 3) & (titanic.age < 10)])
print(f'There were {infants} infants and {children} children on board the Titanic')
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Pandas trick: We want to creat virtual "bins" for readability and replace ranges of values with categories.
We will do this in an ad hoc way, it can be done better. For example in the previous plot we could set:
(age<3) = 'infants',
(3,
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
(18
=======
(18
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
(18
>>>>>>> upstream/master
=======
(18
>>>>>>> upstream/master
See matplotlib colors here.
<<<<<<< HEAD
<<<<<<< HEAD
In [12]:
# set the colors
=======
In [ ]:
# set the colors
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [77]:
# set the colors
>>>>>>> upstream/master
cmap = plt.get_cmap('Pastel1')
young = cmap(0.5)
middle = cmap(0.2)
older = cmap(0.8)
# get the object we will change - patches is an array with len: num of bins
fig, ax = plt.subplots()
y_values, bins, patches = ax.hist(titanic.age, 10)
[patches[i].set_facecolor(young) for i in range(0,1)] # bin 0
[patches[i].set_facecolor(middle) for i in range(1,3)] # bins 1 and 2
[patches[i].set_facecolor(older) for i in range(3,10)] # 7 remaining bins
ax.grid(True)
fig.show()
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Kernel Density Estimation¶
The kernel density estimate can be a useful tool for plotting the shape of a distribution. The bandwidth (bw) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values.
<<<<<<< HEAD
<<<<<<< HEAD
In [13]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
=======
In [ ]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [78]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
>>>>>>> upstream/master
sns.kdeplot(titanic.age, bw=2, label="bw: 2", shade=True);
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
<<<<<<< HEAD
=======
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Exercise: Plot the distribution of fare paid by passengers
<<<<<<< HEAD
<<<<<<< HEAD
In [14]:
# your code here
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [79]:
# your code here
>>>>>>> upstream/master
sns.kdeplot(titanic.fare, bw=0.5, label="bw: 0.5", shade=True);
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
You can mix elements of matplotlib such as Axes with seaborn elements for a best use of both worlds.¶
<<<<<<< HEAD
<<<<<<< HEAD
In [15]:
import seaborn as sns
=======
In [ ]:
import seaborn as sns
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [80]:
import seaborn as sns
>>>>>>> upstream/master
sns.set(color_codes=True)
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
fig, ax = plt.subplots(1,2, figsize=(15,5))
# seaborn goes in first subplot
sns.set(font_scale=0.5)
sns.distplot(x1, kde=False, bins=15, ax=ax[0]);
sns.distplot(x2, kde=False, bins=15, ax=ax[0]);
ax[0].set_title('seaborn Graph Here', fontsize=14)
ax[0].set_xlabel(r'$x$', fontsize=14)
ax[0].set_ylabel(r'$count$', fontsize=14)
# matplotlib goes in second subplot
ax[1].hist(x1, alpha=0.2, bins=15, label=r'$x1$');
ax[1].hist(x2, alpha=0.5, bins=15, label=r'$x2$');
ax[1].set_xlabel(r'$x$', fontsize=14)
ax[1].set_ylabel(r'$count$', fontsize=14)
ax[1].set_title('matplotlib Graph Here', fontsize=14)
ax[1].legend(loc='best', fontsize=14);
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
<<<<<<< HEAD
=======
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Introduding the heart disease dataset.¶
More on this in the in-class exercise at the end of the notebook.
<<<<<<< HEAD
<<<<<<< HEAD
In [16]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
=======
In [ ]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [81]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg",
>>>>>>> upstream/master
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
heart_df = pd.read_csv('../data/heart_disease.csv', header=None, names=columns)
heart_df.head()
<<<<<<< HEAD
<<<<<<< HEAD
Out[16]:
=======
Out[81]:
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
<<<<<<< HEAD
<<<<<<< HEAD
In [17]:
# seaborn
=======
In [ ]:
# seaborn
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [82]:
# seaborn
>>>>>>> upstream/master
ax = sns.boxplot(x='age', data=titanic)
#ax = sns.boxplot(x=titanic['age']) # another way to write this
ax.set_ylabel(None);
ax.set_xlabel('age', fontsize=14);
ax.set_title('Distribution of age in the Titanic', fontsize=14);
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Two variables¶
<<<<<<< HEAD
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
=======
>>>>>>> upstream/master
Exercise: Did more young people or older ones get first class tickets on the Titanic?
<<<<<<< HEAD
<<<<<<< HEAD
In [18]:
# your code here
# two variables seaborn
ax = sns.boxplot(x="class", y="age", data=titanic)
=======
In [ ]:
=======
In [83]:
>>>>>>> upstream/master
# your code here
# two variables seaborn
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
ax = sns.boxplot(x='class', y='age', data=titanic)
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
In [19]:
# two variable boxplot in pandas
titanic.boxplot('age',by='class')
=======
In [ ]:
# two variable boxplot in pandas
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [84]:
# two variable boxplot in pandas
titanic.boxplot('age',by='class')
>>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
Out[19]: